Theoretical Background

Theoretical Background

Self-organizing Data Mining

Rapid development of information technology, continuing computerization in almost every field of human activity, and distributed computing has delivered a flood of data which is now stored in data bases or data warehouses. In the 1960s, Management Information Systems (MIS) and then, in the 1970s, Decision Support Systems (DSS) were praised for their potential to supply executives with data needed to carry out job responsibilities. While these systems indeed have supplied some useful information for executives, they have not lived up to their proponents' expectations. It has been shown that there is no lack of data, but a lack of information of an object's behavior when making complex decisions.

Today, there is an increased need to discover information - contextual data - non obvious and valuable for decision making from a large collection of data efficiently. This is an interactive and iterative process of various subtasks and decisions and is called Knowledge Discovery from Data. The engine of Knowledge Discovery - where data is transformed into knowledge for decision making - is Data Mining.

There are very different data mining tools available and many papers are published describing data mining techniques. We think that the priority for advanced data mining is to limit the involvement of users in the entire data mining process to the inclusion of well-known a priori knowledge, exclusively, while making this process more automated and more objective. Most users' primary interest is in model results proper without having to have extensive knowledge of mathematical, cybernetical and statistical techniques or sufficient time for dialog driven modeling tools. Soft computing, i.e., Fuzzy Modeling, Neural Networks, Genetic Algorithms and other methods of automatic model generation, is a way to mine data by generating mathematical models from empirical data more or less automatically.

In the past years there has been much publicity about the ability of Artificial Neural Networks to learn and to generalize despite important problems with design, development and application of Neural Networks:

Neural Networks have no explanatory power by default that is describing why results are as they are. This means that the knowledge (models) extracted by Neural Networks is still hidden and distributed over the network.

There is no systematical approach for designing and developing Neural Networks. It is a trial-and-error process.

Training of Neural Networks is a kind of statistical estimation often using algorithms that are slower and less effective than algorithms used in statistical software.

If noise is considerable in a data sample the generated models systematically tend to be overfitted.

In contrast to traditional Neural Networks that use

Genetic Algorithms as an external procedure to optimize the network architecture and

several pruning techniques to counteract overtraining.

KnowledgeMiner employs principles of evolution - inheritance, mutation and selection - for generating a network structure systematically enabling combined automatic model structure synthesis and model validation. Models are generated adaptively from data in the form of networks of active neurons in an evolutionary fashion of repetitive generation of populations of competing models of growing complexity, their validation and selection until an optimally complex model - not too simple and not too complex - have been created. That is, growing a tree-like network out of seed information (input and output variables' data) in an evolutionary fashion of pairwise combination and survival-of-the-fittest selection from a simple single individual (neuron) to a desired final, not overspecialized behavior (model). Neither, the number of neurons and the number of layers in the network, nor the actual behavior of each created neuron is predefined. All this is adjusted during the process of self-organization, and therefore, is called self-organizing data mining.

The differences between traditional Neural Networks and this new approach are focusing on Statistical Learning Networks and Induction. The first Statistical Learning Network algorithm of this new type, the Group Method of Data Handling (GMDH), was developed by A.G. Ivakhnenko in 1967. Considerable improvements were introduced in the 1970s and 1980s by versions of the Polynomial Network Training algorithm (PNETTR) by Barron and the Algorithm for Synthesis of Polynomial Networks (ASPN) by Elder when Adaptive Learning Networks and GMDH were flowing together. Further enhancements of the GMDH algorithm have been realized in KnowledgeMiner.

Why Data Mining is needed

Decision making in every field of human activity needs problem detection in addition to a decision makers feeling that a problem exists or that something is wrong. The basis for every decision is models. It is worth building models to aid decision making for the following reasons:

models make it possible:

to recognize the structure and function of complicated objects (subject of identification) which leads to deeper understanding of the problem. Models can usually be analysed more readily than the original problem;

to find appropriate means which can be used for exercising an active influence on the objects (subject of control);

to predict what the respective objects have to expect in the future (subject of prediction) but also to experiment with models, and thus to answer "what-if" type questions.

Therefore mathematical modeling formed the core of almost all decision support systems.

Models can be derived from existing theory (theory driven approach or theoretical systems analysis) and/or from data (data driven approach or experimental systems analysis).

a. theory driven approach

For complex ill-defined systems, such as economic, ecological, social, biological a.o. systems, we have insufficient a priori knowledge about the relevant theory of the system under research. Modeling based on a theory driven approach is considerably affected by the fact that the modeler often has to know things about the system that are generally impossible to find. This concerns uncertain a priori information with regard to the selection of the model structure (factors of influence and functional relations) as well as insufficient knowledge about interference factors (actual interference factors and factors of influence which can not be measured). According to this, insufficient a priori information concerns the required a priori knowledge on the object under research be connected to:

the main factors of influence (endogenous variables or input variables) and also the classification of variables as endogenous and exogenous;

the functional form of the relation between the variables including the dynamic specification of the model;

the description of errors such as their correlation structure.

In order to overcome these problems and to deal with ill-defined systems and, in particular, with insufficient a priori knowledge, there is a need to find ways on how it is possible, with the help of emergent information engineering, to reduce the time and resource intensive model formation process required before one can start initial task solving. Computer-aided design of mathematical models may soon prove as highly valuable in bridging the gap.

b. data driven approach

Modern information technologies delivers a flood of data and there is a question how to leverage them. Commonly, statistically based principles are used for model formation. But with them there is always the need to have a priori knowledge about the structure of the mathematical model.

In addition to the epistemological problems of commonly used statistical principles of model formation, there are several methodological problems which may arise in conjunction with the insufficience of a priori information. This indeterminacy of the starting position marked by the subjectivity and incompletedness of the theoretical knowledge and an insufficient data basis leads to several methodological problems as described in [Lemke/Müller,1997].

Knowledge discovery from data and specifically data mining techniques and tools can assist humans in analyzing the mountains of data and to turn information located in the data into successful decision making.

Data mining includes not just a single analytical technique but many methods and techniques depending on the nature of the enquiry. These methods contain data visualization, tree-based methods and methods of mathematical statistics as well as those for knowledge extraction from data using self-organizing modeling to turn information located in the data into successful decision making.

Data mining is an interactive and iterative process of numerous subtasks and decisions such as data selection and pre-processing, choice and application of data mining algorithms and analysis of the extracted knowledge. Most important for a more sophisticated data mining application is to try to limit the involvement of users in the overall data mining process to the inclusion of existing a priori knowledge while making this process more automated and more objective.

Automatic model generation like GMDH, Analog Complexing, and GMDH-based Fuzzy Rule Induction is based on these demands and provides sometimes the only way to generate models of ill-defined problems.

Contact:
knowledgeminer@iworld.to

Date Last Modified: 03/23/99